Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts

نویسندگان

  • Michal Marcinczuk
  • Maciej Janicki
چکیده

In this paper we present several optimizations introduced to Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers incompleteness, problem of unambiguous generalization features, feature construction and selection, and finally recognition of common proper names on the basis of external sources of knowledge. The problem of proper name recognition is limited to recognition of person first names and surnames, names of countries, cities and roads. The evaluation is performed in two ways: a single domain evaluation using 10-fold cross validation on a Corpus of Stock Exchange Reports and a cross-domain evaluation on a Corpus of Economic News. An additional corpus of Wikipedia articles, namely InfiKorp is used in the feature selection. Finally, we evaluate three configurations of proposed modifications. The top configuration improved the final result from 94.53% to 95.65% of F-measure for single domain and from 70.86% to 79.63% for cross-domain evaluation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تشخیص اسامی اشخاص با استفاده از تزریق کلمه‌های نامزد اسم در میدان‌های تصادفی شرطی برای زبان عربی

Named Entity Recognition and Extraction are very important tasks for discovering proper names including persons, locations, date, and time, inside electronic textual resources. Accurate named entity recognition system is an essential utility to resolve fundamental problems in question answering systems, summary extraction, information retrieval and extraction, machine translation, video interpr...

متن کامل

Resources for Information Extraction from Polish texts

The paper presents a collection of resources developed for Information Extraction (IE) from Polish texts. In particular, we mention two IE platforms adapted to Polish and several IE applications built on top of one of them: named entity recognition, creation of terminology lexicons, and data extraction from medical texts.

متن کامل

Recognition of Named Entities in Spanish Texts

Proper name recognition is a subtask of Name Entity Recognition in Message Understanding Conference. For our corpus annotation proper name recognition is a crucial task since proper names appear approximately in more than 50% of total sentences of the electronic texts that we collected for such purpose. Our work is focused on composite proper names (names with coordinated constituents, names wi...

متن کامل

Automatic Detection and Classification of Phoneme Repetitions Using Htk Toolkit

The therapy of stuttering people is based on a proper selection of texts and then on a practice of their articulation by reading or narration. The texts are chosen on the basis of kind and intensity of dysfluencies appearing in a speech. Thus there is still a requirement to find effective and objective methods of analysis of dysfluent speech. Hidden Markov models are stochastic models widely us...

متن کامل

NEROC: Named Entity Recognizer of Chemicals

We describe a pipeline system, Named Entity Recognizer of Chemicals (NEROC), that aims to identify chemical entities mentioned in free texts. The system is based on a machine learning approach, a Conditional Random Field (CRF), and a selection of feature sets that are used to capture specific characteristics of chemical named entities. In this paper, we report results that produced by CRF model...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012